Applied Generative AI for AI Developers
RAG = Retrieval Augumented Generation
A generative AI approach where the model combines external knowledge retrieval with text generation to provide more accurate and contextually rich responses.
Augments LLM responses with relevant context: Instead of relying solely on the LLM’s training data, RAG retrieves and incorporates specific, up-to-date information into responses.
Helps ground responses in factual information: By providing relevant context from trusted sources, RAG ensures responses are based on actual facts rather than model-generated content.
Reduces hallucinations: With access to specific, retrieved information, the model is less likely to generate incorrect or fabricated responses.
Enables use of private/proprietary data: Organizations can leverage their internal documents, knowledge bases, and proprietary information that wasn’t part of the LLM’s training data.
Provides source attribution: RAG systems can track where information comes from, making responses more transparent and verifiable.
Key Components:
Prepare documents: Clean and preprocess your source documents, removing irrelevant content and standardizing format.
Create embeddings: Convert text chunks into numerical vectors using embedding models like BGE-large-en-v1.5 (available on Hugging Face), Amazon Titan embeddings, OpenAI’s ada-002 or Cohere’s embed-multilingual.
Store in vector database: Upload embeddings to a vector store like Pinecone, Weaviate, or FAISS for efficient similarity search.
Process user query: Convert the user’s question into an embedding using the same embedding model.
Retrieve relevant context: Perform similarity search to find the most relevant document chunks.
Generate response: Combine retrieved context with an LLM prompt to generate an accurate, contextual response.
Reference: Chunking techniques with LangChain and LllamaIndex
Document segmentation approaches: Choose between fixed-size chunks, semantic chunking, or paragraph-based splitting depending on your content structure.
Chunk size considerations: Balance between too large (dilutes relevance) and too small (loses context) - typically 256-1024 tokens works well.
Overlap between chunks: Include some overlap (10-20%) between consecutive chunks to maintain context across boundaries.
Maintaining context: Preserve important metadata and hierarchical information when splitting documents.
Structured vs unstructured data: Adapt chunking strategy based on whether you’re dealing with free text, tables, or structured documents.
Key Considerations:
Model selection criteria: Consider factors like accuracy, speed, cost, and dimension size when choosing an embedding model.
Dimensionality impact: Higher dimensions can capture more information but increase storage costs and retrieval time.
Multi-lingual support: Choose models like Cohere multilingual or Amazon Titan if your application needs to handle multiple languages.
Domain-specific needs: Consider fine-tuning embedding models for specialized domains like medical or legal text. Finet-tuning using Sentence Transformers.
Features to Consider:
Scalability: Ability to handle millions or billions of vectors efficiently.
Query performance: Fast similarity search with support for approximate nearest neighbors (ANN) algorithms.
Similarity search algorithms: Support for different distance metrics (cosine, euclidean) and indexing methods.
Metadata filtering: Ability to combine vector similarity search with metadata filters.
Cost considerations: Balance between hosting costs, query costs, and storage requirements.
| Feature | Vector DB RAG | Graph RAG |
|---|---|---|
| Storage | Dense embeddings (vectors) | Nodes & relationships |
| Retrieval | Nearest neighbor search | Graph traversal queries |
| Scalability | Efficient for large text | More complex, depends on structure |
| Context | Semantic similarity only | Rich, structured context |
| Use Case | Unstructured knowledge | Structured reasoning |
Niels Bohr.Bohr
MATCH (p:Person {id: "Bohr"})-[:COLLABORATED_WITH]->(collaborator) RETURN collaborator.idMATCH (p:Person {id: "Bohr"})-[:STUDY_UNDER*2]->(mentor) RETURN mentor.idThe vector search for this would have to include potentially several chunks of text and may still not get all the collaborators whereas the graph retrieval would be deterministic and more accurate.
retriever = vectorstore.as_retriever()
retriever.get_relevant_documents("Who all did Bohr collaborate with?")SELECT AVG(trip_distance) AS avg_trip_distance
FROM nyc_taxi_data
WHERE DATE(tpep_pickup_datetime) = '2024-12-11';from langchain.sql_database import SQLDatabase
from langchain.chains import SQLDatabaseChain
from langchain_aws import ChatBedrockConverse
import boto3
# Initialize Bedrock client
bedrock = boto3.client(
service_name='bedrock-runtime',
region_name='us-east-1' # replace with your region
)
# Initialize the LLM
llm = ChatBedrockConverse(
model_id="anthropic.claude-3-sonnet-20240229", # or your preferred Claude model
client=bedrock,
model_kwargs={"temperature": 0}
)
# Connect to database
db = SQLDatabase.from_uri("sqlite:///example.db")
# Create the chain
sql_chain = SQLDatabaseChain.from_llm(llm=llm, database=db, verbose=True)
# Run the query
sql_chain.run("What are the top 5 research topics?")| Metric | Description |
|---|---|
| Cross-Modal Relevance | Alignment between retrieved items across modalities |
| Response Coherence | Integration of multi-modal information in outputs |
| Retrieval Latency | Time to fetch and process multi-modal context |
| Memory Usage | Resource requirements for different modalities |